An NoC Traffic Compiler for Efficient FPGA Implementation of Sparse Graph-Oriented Workloads

نویسندگان

  • Nachiket Kapre
  • André DeHon
چکیده

Parallel graph-oriented applications expressed in the Bulk-Synchronous Parallel (BSP) and Token Dataflow compute models generate highly-structured communication workloads from messages propagating along graph edges. We can statially expose this structure to traffic compilers and optimization tools to reshape and reduce traffic for higher performance (or lower area, lower energy, lower cost). Such offline traffic optimization eliminates the need for complex, runtime NoC hardware and enables lightweight, scalable NoCs. We perform load balancing, placement, fanout routing, and fine-grained synchronization to optimize our workloads for large networks up to 2025 parallel elements for BSP model and 25 parallel elements for Token Dataflow. This allows us to demonstrate speedups between 1.2× and 22× (3.5× mean), area reductions (number of Processing Elements) between 3× and 15× (9× mean) and dynamic energy savings between 2× and 3.5× (2.7× mean) over a range of real-world graph applications in the BSP compute model. We deliver speedups of 0.5–13× (geomean 3.6×) for Sparse Direct Matrix Solve (Token Dataflow compute model) applied to a range of sparse matrices when using a high-quality placement algorithm. We expect such traffic optimization tools and techniques to become an essential part of the NoC application-mapping flow. Disciplines Electrical and Computer Engineering | Engineering | Systems Engineering This journal article is available at ScholarlyCommons: http://repository.upenn.edu/ese_papers/702 Hindawi Publishing Corporation International Journal of Reconfigurable Computing Volume 2011, Article ID 745147, 14 pages doi:10.1155/2011/745147 Research Article AnNoC Traffic Compiler for Efficient FPGA Implementation of Sparse Graph-OrientedWorkloads Nachiket Kapre1 and André Dehon2 1Department of Electrical and Electronic Engineering, Imperial College London, London SW7 2AZ, UK 2Department of Electrical and Systems Engineering, University of Pennsylvania, Philadelphia, PA 19104, USA Correspondence should be addressed to Nachiket Kapre, [email protected] Received 28 August 2010; Accepted 14 December 2010 Academic Editor: Michael Hübner Copyright © 2011 N. Kapre and A. Dehon. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Parallel graph-oriented applications expressed in the Bulk-Synchronous Parallel (BSP) and Token Dataflow compute models generate highly-structured communication workloads from messages propagating along graph edges. We can statially expose this structure to traffic compilers and optimization tools to reshape and reduce traffic for higher performance (or lower area, lower energy, lower cost). Such offline traffic optimization eliminates the need for complex, runtime NoC hardware and enables lightweight, scalable NoCs. We perform load balancing, placement, fanout routing, and fine-grained synchronization to optimize our workloads for large networks up to 2025 parallel elements for BSP model and 25 parallel elements for Token Dataflow. This allows us to demonstrate speedups between 1.2× and 22× (3.5× mean), area reductions (number of Processing Elements) between 3× and 15× (9× mean) and dynamic energy savings between 2× and 3.5× (2.7× mean) over a range of real-world graph applications in the BSP compute model. We deliver speedups of 0.5–13× (geomean 3.6×) for Sparse Direct Matrix Solve (Token Dataflow compute model) applied to a range of sparse matrices when using a high-quality placement algorithm. We expect such traffic optimization tools and techniques to become an essential part of the NoC application-mapping flow.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An NoC Traffic Compiler for efficient FPGA implementation of Parallel Graph Applications

Parallel graph algorithms expressed in a BulkSynchronous Parallel (BSP) compute model generate highlystructured communication workloads from messages propagating along graph edges. We can expose this structure to traffic compilers and optimization tools before runtime to reshape and reduce traffic for higher performance (or lower area, lower energy, lower cost). Such offline traffic optimizatio...

متن کامل

Ultra-Fast and Accurate Simulation for Large-Scale Many-Core Processors

Many-core processor architectures are becoming mainstream. With the many-core trend, Network-on-Chip (NoC) has become the de facto on-chip communication fabric, replacing traditional bus-based architectures. Targeting thousands of cores in near future many-core architectures, large-scale NoC designs need to be modeled and evaluated fast and accurately to understand their performance characteris...

متن کامل

Efficient Routing Implementation of Programmable Network on Chip on FPGA using Circuit Switching Approach

More and more complex and larger system on chips are getting developed as a result of increase in chip density following Moore’s law. Advanced SoCs have in their shelf significantly noticeable communication mechanisms. NoC has solved the scalability problems to a larger extent compared to bus based interconnect. NoC has been providing a back bone infrastructure for System-on-chips since long. A...

متن کامل

FPGA Hardware Implementation and Evaluation of a Micro-Network Architecture for Multi-Core Systems

This paper presents the design, implementation and evaluation of a micro-network, or Network-on-Chip (NoC), based on a generic pipeline router architecture. The router is designed to efficiently support traffic generated by multimedia applications on embedded multi-core systems. It employs a simplest routing mechanism and implements the round-robin scheduling strategy to resolve output port con...

متن کامل

Realistic Workload Characterization and Analysis for Networks-on-Chip Design

As silicon device scaling trends have simultaneously increased transistor density while reducing component costs, architectures incorporating multiple communicating components are becoming more common. In these systems, networks-on-chip (NOCs) connect the components for communication and NOC design is critical to the performance and efficiency of the system. Typically, in NOC design traditional...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Int. J. Reconfig. Comp.

دوره 2011  شماره 

صفحات  -

تاریخ انتشار 2011